4. Sharding
So far, you've seen patterns that implement a single connection at a time. In a shard (see Figure 4),
multiple databases can be accessed simultaneously in a read and/or
write fashion and can be located in a mixed environment (local and
cloud). However, keep in mind that the total availability of your shard
depends partially on the availability of your local databases.
Shards are typically
implemented when performance requirements are such that data access
needs to be spread over multiple databases in a scale-out approach.
4.1. Shard Concepts and Methods
Before visiting the
shard patterns, let's analyze the various aspects of shard design. Some
important concepts are explained here:
Decision rules. Logic that determines without a doubt which database contains the interesting record(s). For example, if Country = US, then connect to SQL Azure Database #1.
Rules can be static (hardcoded in C#, for example) or dynamic (stored
in XML configuration files). Static rules tend to limit the ability to
grow the shard easily, because adding a new database is likely to
change the rules. Dynamic rules, on the other hand, may require the
creation of a rule engine. Not all shard libraries use decision rules.
Round-robin.
A method that changes the database endpoint for every new connection
(or other condition) in a consistent manner. For example, when
accessing a group of five databases in a round-robin manner, the first
connection is made to database 1, the second to database 2, and so on.
Then, the sixth connection is made to database 1 again, and so forth.
Round-robin methods avoid the creation of decision engines and attempt
to spread the data and the load evenly across all databases involved in
a shard.
Horizontal partition.
A collection of tables with similar schemas that represent an entire
dataset when concatenated. For example, sales records can be split by
country, where each country is stored in a separate table. You can
create a horizontal partition by applying decision rules or using a
round-robin method. When using a round-robin method, no logic helps
identify which database contains the record of interest; so all
databases must be searched.
Vertical partition.
A table schema split across multiple databases. As a result, a single
record's columns are stored on multiple databases. Although this is
considered a valid technique, vertical partitioning isn't explored in
this book.
Mirrors.
An exact replica of a primary database (or a large portion of the
primary database that is of interest). Databases in a mirror
configuration obtain their data at roughly the same time using a
synchronization mechanism like SQL Data Sync. For example, a mirror
shard made of two databases, each of which has the Sales table, has the
same number of records in each table at all times. Read operations are
then simplified (no rules needed) because it doesn't matter which
database you connect to; the Sales table contains the data you need in
all the databases.
Shard definition.
A list of SQL Azure databases created in a server in Azure. The
consumer application can automatically detect which databases are part
of the shard by connecting to the master database. If all databases
created are part of the shard, enumerating the records in sys.databases gives you all the databases in the shard.
Breadcrumbs.
A technique that leaves a small trace that can be used downstream for
improved decisions. In this context, breadcrumbs can be added to
datasets to indicate which database a record came from. This helps in
determining which database to connect to in order to update a record
and avoids spreading requests to all databases.
When using a shard,
a consumer typically issues CRUD (create, read, update, and delete)
operations. Each operation has unique properties depending on the
approach chosen. Table 1
outlines some possible combinations of techniques to help you decide
which sharding method is best for you. The left column describes the
connection mechanism used by the shard, and the top row identifies the
shard's storage mechanism.
Table 1. Shard access techniques
| Horizontal partitions | Mirror |
---|
Decision rules | Rules determine how equally records are spread in the shard.
Create: Apply rules.
Read: Connect to all databases with rules included as part of a WHERE clause, or choose a database based on the rules. Add breadcrumbs for update and delete operations.
Update: Apply rules or use breadcrumbs, and possibly move records to another database if the column updated is part of the rule.
Delete: Apply rules, or use breadcrumbs when possible. | This
combination doesn't seem to provide a benefit. Mirrored databases
aren't partitioned, and so no rule exists to find a record. |
Round-robin | Records
are placed randomly in databases based on the available connection. No
logic can be applied to determine which database contains which records.
Create: Insert a record in the current database.
Read: Connect to all databases, issue statements, and concatenate resultsets. Add breadcrumbs for update and delete operations.
Update: Connect to all databases (or use breadcrumbs), and apply updates using a primary ke.
Delete: Same as update. | All records are copied to all databases. Use a single database (called the primary database) for writes.
Create: Insert a record in the primary database only.
Read: Connect to any database in a round-robin fashion.
Update: Update a record in the primary database only.
Delete: Delete a record in the primary database only. |
Shards can be very difficult
to implement. Make sure you test thoroughly when implementing shards.
You can also look at some of the shard libraries that have been
developed. The shard library found on CodePlex and explained further in
this article uses .NET 4.0; you can find it with its source code at http://enzosqlshard.codeplex.com. It uses round-robin as its access method. You can also look at another implementation of a shard library that uses SQLAzureHelper; this shard library uses decision rules as its access method and is provided by the SQL Azure Team (http://blogs.msdn.com/b/sqlazure/).
4.2. Read-Only Shards
Shards can be implemented
in multiple ways. For example, you can create a read-only shard (ROS).
Although the shard is fed from a database that accepts read/write
operations, its records are read-only for consumers.
Figure 5
shows an example of a shard topology that consists of a local SQL
Server to store its data with read and write access. The data is then
replicated using the SQL Data Sync framework (or other method) to the
actual shards, which are additional SQL Azure databases in the cloud.
The consuming application then connects to the shard (in SQL Azure) to
read the information as needed.
In one scenario, the SQL
Azure databases each contain the exact same copy of the data (mirror
shard), so the consumer can connect to one of the SQL Azure databases
(using a round-robin mechanism to spread the load, for example). This
is perhaps the simpler implementation because all the records are
copied to all the databases in the shard blindly. However, keep in mind
that SQL Azure doesn't support distributed transactions; you may need
to have a compensating mechanism in case some transactions commit and
others don't.
Another
implementation of the ROS consists of synchronizing the data using
horizontal partitioning. In a horizontal partition, rules are applied
to determine which database contains which data. For example, the SQL
Data Synch service can be implemented to partition the data for US
sales to one SQL Azure database and European sales to another. In this
implementation, either the consumer knows about the horizontal
partition and knows which database to connect to (by applying decision
rules based on customer input), or it connects to all databases in the
cloud by applying a WHERE clause on
the country if necessary, avoiding the cost of running the decision
engine that selects the correct database based on the established rules.
4.3. Read-Write Shards
In a read-write shard
(RWS), all databases are considered read/write. In this case, you don't
need to use a replication topology that uses the SQL Data Sync
framework because there is a single copy of each record within the
shard. Figure 6 shows a RWS topology.
Although a RWS
removes the complexity of synchronizing data between databases, the
consumer is responsible for directing all CRUD operations to the
appropriate cloud database. This requires special considerations and
advanced development techniques to accomplish, as previously discussed.